Fabricating conversational speech data with acoustic models: a program to examine model-data mismatch
نویسندگان
چکیده
We present a study of data simulated using acoustic models trained on Switchboard data, and then recognized using various Switchboard-trained acoustic models. When we recognize real Switchboard conversations, simple development models give a word error rate (WER) of about 47 percent. If instead we simulate the speech data using word transcriptions of the conversation, obtaining the pronunciations for the words from our recognition dictionary, the WER drops by a factor of five to ten. In a third type of experiment, we use human-generated phonetic transcripts to fabricate data that more realistically represents conversational speech, and obtain WERs in the low 40’s, rates that are fairly similar to those seen in actual speech data. Taken as a whole, these and other experiments we describe in the paper suggest that there is a substantial mismatch between real speech and the combination of our acoustic models and the pronunciations in our recognition dictionary. The use of simulation appears to be a promising tool in our efforts to understand and reduce the size of this mismatch, and may prove to be a generally valuable diagnostic in speech recognition research .
منابع مشابه
Resegmentation of SWITCHBOARD
The SWITCHBOARD (SWB) corpus is one of the most important benchmarks for recognition tasks involving large vocabulary conversational speech (LVCSR). The high error rates on SWB are largely attributable to an acoustic model mismatch, the high frequency of poorly articulated monosyllabic words, and large variations in pronunciations. It is imperative to improve the quality of segmentations and tr...
متن کاملJoint Uncertainty Decoding for Robust Large Vocabulary Speech Recognition
Standard techniques to increase automatic speech recognition noise robustness typically assume recognition models are clean trained. This “clean” training data may in fact not be clean at all, but may contain channel variations, varying noise conditions, as well as different speakers. Hence rather than considering noise robustness techniques as compensating clean acoustic models for environment...
متن کاملA robust compensation strategy for extraneous acoustic variations in spontaneous speech recognition
In this paper, we propose a robust compensation strategy to deal effectively with extraneous acoustic variations for spontaneous speech recognition. This strategy extends speaker adaptive training, and uses hidden Markov models (HMM) parameter transformations to normalize the extraneous variations in the training data according to a set of predefined conditions. A “compact” model and the associ...
متن کاملTranscription of Russian conversational speech
This paper presents initial work in transcribing conversational telephone speech in Russian. Acoustic seed models were derived from other languages. The initial studies are carried out with 9 hours of transcribed data, and explore the choice of the phone set and use of other data types to improve transcription performance. Discriminant features produced by a Multi Layer Perceptron trained on a ...
متن کاملSupport vector machines for automatic data cleanup
Accurate training data plays a very important role in training effective acoustic models for speech recognition. In conversational speech, in several cases, the transcribed data has a significant word error rate which leads to bad acoustic models. In this paper we explore a method to automatically identify such mislabelled data in the context of a hybrid Support Vector Machine/hidden Markov mod...
متن کامل